Protein folding dynamics via quantification of kinematic energy landscape 
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We study folding dynamics of protein-like sequences on square lattice using physical move set 
that exhausts all possible conformational changes. By analytically solving the master equation, we 
follow the time-dependent probabilities of occupancy of all 802,075 conformations of 16-mers over 7- 
orders of time span. We find that (i) folding rates of these protein-like sequences of same length can 
differ by 4-orders of magnitude, (ii) folding rates of sequences of the same conformation can differ 
by a factor of 190, and (iii) parameters of the native structures, designability, and thermodynamic 
properties are weak predictors of the folding rates, rather, basin analysis of the kinematic energy 
landscape defined by the moves can provide excellent account of the observed folding rates. 



The dynamics of protein folding has been studied ex- 
tensively 00. A remarkable observation is that protein 
folding rates are well correlated with their native struc- 
tural properties . A native centric view postulates that 
protein folding rates are largely determined by the topol- 
ogy of its native structure |3j • Theoretical models using 
Go potential where only native contacts contribute en- 
ergetically are very successful in reproducing observed 
folding rates 

00 

However, the extent to which native structure deter- 
mines folding rate remains unclear. By the native-centric 
view, different sequences for the same protein struc- 
tural fold would all have very similar folding rates, as 
they share essentially the same native structure topol- 
ogy. However, this is not consistent with experimental 
results. As the folding rates of simple single-domain pro- 
teins differ by 6 orders of magnitude 0, folding rates 
may be very heterogeneous. A recent experimental study 
showed that a designed artificial protein with no homol- 
ogous sequence in nature that adopts the same structure 
as a natural protein can fold 4,000 times faster 0. A 
distinct possibility is that the empirical correlation be- 
tween properties of native protein structures and folding 
rates may arise from inadequate sampling in the sequence 
space due to accumulated biased natural selection and 
limited genetic drift, rather than from intrinsic physical 
properties of proteins. 

In this letter, we use two-dimensional hydrophobic and 
polar (HP) lattice model jf| to study the relationship of 
folding rates, native structure topology, thermodynamic 
properties, and effects of sequence variation. We model 
the physical movement of protein chains. Real protein 
cannot immediately jump from one conformation to an- 
other arbitrary conformation. Two conformations of the 
same energy may be well separated kinetically. We re- 
gard protein movement as a sequence of successive con- 
formational changes, each represented by a physically re- 
alizable primitive move. The physical move set we devel- 
oped exhausts all possible conformational changes for a 
structure. We use master equation to study the folding 
dynamics of foldable sequences of length 16. While mas- 
ter equation provides an exact solution , in the past 



it was necessary to cluster conformations of larger sys- 
tems into macrostates to reduce the size of the transition 
matrix |8(, therefore making the use of physical moves 
infeasible. Here we directly solve the eigenvalue problem 
of the 802,075 x 802,075 transition matrix and develop 
a method to monitor the time-dependent probability of 
occupancy of all conformations simultaneously from the 
first kinetic move until reaching half-equilibrium concen- 
tration over 7-orders of time scale. 

Our results show that the properties of native struc- 
ture, designability, and thermodynamic properties are 
inadequate to explain protein folding dynamics in our 
model systems. We found that protein-like sequences can 
fold into the same native structure with folding rates dif- 
fer as much as 190 times and sequences of the same length 
and energy gap can differ by 4-orders of magnitude in 
folding rate. Instead of thermodynamic properties, we 
show that properties of the move-connected energy land- 
scape defined by the connection graph of physical moves 
can provide excellent account for observed folding rates. 

Model. We use the following energy model for dif- 
ferent types of nonbonded HP contacts: Ehh = 1, 
Ehp = 0, and Epp = 0. By evaluating the energy 
level of all 2 16 sequences of 16-mers on all enumerated 
= 802,075 conformations, we have identified 26 se- 
quences that all fold into the same ground state confor- 
mation (Fig.^l. This set of sequences forms the largest 
protein family, where each sequence adopts the same con- 
formation, and all are connected by (a series of) point 
mutations. Altogether, there are 1,539 foldable sequences 
with unique ground state conformations. There are 456 
conformations that are the unique ground state for 1 or 
more foldable sequences. 

We develop a set of physically possible primitive moves 
(Fig. They are generalizations of corner move, 

crankshaft move, and pivot move. We exhaust all possi- 
ble occurrence of such moves for every conformation. We 
verified that this move set is ergodic, i.e., all conforma- 
tions are connected to each other by a series of primitive 
moves. With this move set, the simple energy scheme 
of the HP model leads to a complex energy landscapes, 
with numerous local minima for a foldable sequence. 
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infeasible, Ai and the corresponding eigenvectors n\ and 
V\ can be computed by an Arnoldi method. 



FIG. 1: Protein-like sequences and the set of primitive moves. 
The largest protein family contains (a) 26 sequences that all fold 
into (b) the same structure. Here filled circles are H residues, (c) 
The move set includes: among (1, 2, and 3), single point moves 
rotate around a single point; between (1 and 4), generalized 
corner moves reflect around a diagonal axis connecting any two 
residues; between (1 and 5), generalized crankshaft moves reflect 
around a horizontal or vertical axis. Points of rotation are on gray 
background. For a given conformation, we exhaustively search all 
possible position for point moves, all possible pairs of positions 
for possible generalized corner moves and generalized crankshaft 
moves. 

We use Metropolis-type of dynamics to assign the tran- 
sition rate r^- from conformation i to a neighbor con- 
formation j connected by a move: = 1 if E(J) < 
E(i); r« = e -[E(j)-E(i)]/T if E ^ > and r .. = 

~Yli^k r iki if j = For non- neighbors, r<j = 0. We 
assume the effects of viscosity and friction are negligible. 

We follow 0, @ and use a master equation to study 
protein folding dynamics. Let Pi{t) be the probability 
that the HP molecule takes the i-th conformation at the 
timet, then dpi(t) /dt = Y^i^j[ r ]iPji t )~ r ijPi{ t )]- Written 
in vector form, we have: dp{t)/dt — Rp(t), where R is 
the rate matrix whose entries arc defined by the above 
expression. We choose temperature T = 0.2 in unit of 
AEhh /ks, which is below the folding temperature Tf 
when 50% of molecules take the native conformation. Tf 
varies from ~ 0.2 to ~ 0.5 for different sequences. 

A general solution of the master equation can be writ- 
ten as pit) = J2i^i n i e ~ Xit wrtn Ci = vfp(0), where 
Ai is the i-the eigenvalue of the rate matrix R, rii the 
corresponding right eigenvector, Vi the left eigenvector, 
and p(0) the initial vector of distribution of conforma- 
tions. In this study, we use the high temperature con- 
dition and assign p(0) = Two eigenvalues are of 
particular interest: Ao = corresponds to the equilib- 
rium Boltzmann distribution, and the smallest none-zero 
eigenvalue Ai determines the slowest mode of relaxation. 
Following yj, we take Ai as the folding rate kf of the 
protein. Although the full computation of all eigenvalues 
and eigenvectors for a 802,075 x 802,075 matrix M is 



Contact energy 

FIG. 2: The correlation of logfc/ and ground state contact en- 
ergy. A circle represents one of the 26 sequences shown in Fig.0 
a cross represents one of the 79 singleton sequences, and the di- 
amond represents the Go model. Native conformations for a few 
sequences are also shown. 

Thermodynamics and folding rates. Several ther- 
modynamic properties have been proposed to be deter- 
minants of protein folding rates. We found that protein 
stability as measured by the total contact energy are cor- 
related with logfc/ (R 2 — 0.71), i.e., more stable proteins 
fold slower in general (Fig. EJ. Because stable proteins 
have lower ground state energy, some local minima will 
also have relatively deep energy traps. As a result, more 
stable proteins will have slower folding rates because they 
can be trapped in such local minima. However, the fold- 
ing rates of sequences of the same ground state energy 
can still differ as much as 10 4 . The heterogeneity of 
folding rate was already noted in an earlier study using 
macrostate approximation Here we found that even 
sequences that fold into the same conformation shown in 
Fig-Hi demonstrate a wide range of rates, from 1.1 x 10 -3 
to 5.8 x 10 -6 , which is much larger than the difference be- 
tween the average folding rates for sequences of different 
native state energies. Protein stability therefore provide 
some but not the main explanation of the heterogeneity 
of folding rates. 

Energy gap between ground state and excited state was 
thought to be the necessary and sufficient determinant 
of folding rate 0. For all 1,456 protein- like sequences 
of N = 16, the energy gap between the lowest state and 
the next state is AE = 1. The diversity in folding rate 
kf shown in Fig. [21 clearly indicates that energy gap is 
not a determining factor for the folding rate. The corre- 
lation R 2 between log kf and energy gap normalized by 
standard deviation is 0.01. 

Another thermodynamic property thought to be an im- 
portant determinant of folding rates is the collapse coop- 
erativitycr = l-T f /T e 10], where Tf is as defined earlier, 
and Tg the temperature when heat capacity C{T) reaches 
its maximum. Fig. [21 shows that for the 26 sequences 
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that fold to the same native structure in Fig. QJd, there 
is a weak correlation (R 2 — 0.38) between collapse coop- 
erativity and log k f . Large variance in observed folding 
rates exist for sequences of similar collapse cooperativity. 
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FIG. 3: Examples of the correlation of folding rate kf with 
thermodynamic properties and kinetic landscape properties, (a) 
kf and collapse cooperativity a have weak correlation (R 2 = 
0.38). (b) kf has excellent correlation with the number of local 
minima (R 2 = 0.85), a property of the kinetic landscape. The 
diamond represents the Go model. 

The number of sequences that take a specific confor- 
mation as the unique ground state is thought to be corre- 
lated with overall protein stability and folding rates [ll| . 
We calculated in addition kf for a group of 79 single- 
ton sequences with no sequence homologs that fold to 
the same native conformations. The distribution of fc/s 
for the singleton sequences and the 26 sequences shown 
in Fig. [21 demonstrate similarly large variation. For our 
model, designability is not an important determinant of 
the folding rates. 

The Inverse Participation Ratio / is commonly used to 
characterize the localization of eigenvectors. It is defined 
as I = J2 k vl, where v k is the fc-th coefficient of the 
normalized eigenvector. The correlation between I for 
the equlibrium eigenvector and the folding rate for the 
26 sequences is rather poor (R 2 = 2 x 10~ 3 ). 

Kinematic determinants of folding landscape. 
Protein folding kinetics are intrinsically determined by 
physical movement of molecules. Weak correlations of 
the folding rate with thermodynamic properties are not 
surprising. Thermodynamic properties of a sequence can 
be calculated if its complete set of conformations are enu- 
merated. Such properties are not affected by the kinetic 
connections between conformations. A smooth energy 
landscape ensuring fast folding can be easily permuted 
into a rugged landscape by assuming different transi- 
tion rules between conformations. Both will have the 
same thermodynamic properties, but the resulting fold- 
ing rates for the same sequence will be very different. The 
energy landscape of folding is dictated by the connection 
graph of states defined by the move set. Characterizing 
such kinematic energy landscape is therefore essential for 
studying protein folding dynamics. 

Although the energy landscape contains 802,075 con- 
formations, each is connected by the move set to only a 



limited number (~ 30) of conformations. We can identify 
states that are local minima, i.e., all states connected to 
which by moves have higher energy. A simple characteri- 
zation of the kinematic energy landscape is then the num- 
ber count n m i n of the local minima. Fig. [3t> shows that 
an excellent correlation of log kf and n min (R 2 = 0.85) 
can be found for the 26 HP sequences that fold into the 
same conformation. 

Our conclusions are not sensitive to temperature T. 
When T is raised from 0.20 to 0.21 (equivalent to raising 
T from 300.Fr to 315-fT), we found that the folding rate 
kf of the 26 sequences all increases. Although kf for slow 
folder increases more (by a factor of 2.0 versus a factor 
of 1.4 for fast folders), kf at T = 0.21 is well-correlated 
with kf at T = 0.20. The correlation coefficients of log kf 
with the number of local minima, collapse cooperativity 
(Fig.|3J), and other thermodynamic parameters are essen- 
tially unchanged. 

Time evolution and basin analysis. Monitoring 
the exact time evolution of all conformations simulta- 
neously until reaching equilibrium during folding is a 
challenging task. Mathematically, the model of master 
equation is equivalent to a Markov process, where the 
population vector of conformations at time t + kAt is 
given by p(t + kAt) = M k p(t), where M = I + R ■ At, 
I being the identity matrix. However, the /c-time step 
Markov matrix M k rapidly becomes a dense matrix, 
and following the time evolution of folding by a straight- 
forward matrix multiplication of C(|0| 3 log /c) steps be- 
comes impossible for a large matrix of size = 802, 075 
and k — 10 6 — 10 10 . The analytical solution of p(t) = 
J2i CiTiie~ Xit through diagonalization is also impracti- 
cal, as it is only possible to calculate a few eigenvectors 
and eigenvalues for a large matrix. 

We seek an accurate solution without the approxima- 
tion of macrostates. Taking advantage of the sparsity 
of the rate matrix R, we follow the approach of Sidje 
|l2| and use the analytical solution of matrix exponen- 
tial: pit) — e Rt p(0), where e Rt is defined by the Taylor 

expansion e Rt = I + tR+ l ^R 2 H V ^R k + ■ ■ ■ . This 

expansion itself is impractical, as it also involves large 
matrix product of increasing density. Plus, the entries in 
the matrix terms may have alternating signs and hence 
cause numerical instability. A better approach is to ex- 
pand e Rt p(0) in the Krylov subspace K m defined as: 

K m (Rt,p(0)) = Span{p(0), • • ■ , {Rt)™- 1 p{Q)} . (1) 

Denoting || ■ ||2 as the 2-norm of a vector or 
matrix, our approximation then becomes p(t) w 

||p(0)||2V m +ie t -^ m+1 ei, where e\ is the first unit ba- 
sis vector, V m +i is a (m + 1) x (to + 1) matrix 
formed by the orthonormal basis of the Krylov sub- 
space, and H m+ i the upper Heisenberg matrix, both 
computed from an Arnoldi algorithm. The error can 
be bounded by 0(e m -^ R ^(t\\R\\ 2 /m) m ). We now 
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only need to compute explicitly e m+1 . Because m 
is much smaller than 802,075, this is a simpler prob- 
lem. A special form of the Pade rational of poly- 
nomials instead of Taylor expansion is used for this 
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Fig. 0| shows 
an example of 
an HP sequence 
(sequence C in 
Fig. QJi) and the 
time evolution 
of its native 
conformation 
and several local 
minima confor- 
mations. The 
time evolution 
of the native 
conformation 
shows an initial 
fast phase upto 
t ~ 50 time 
units. In prin- 
ciple, the local 
minima con- 
formations can 
follow different 
kinetic processes: 



-i) = I2l=o Ck(tH m+ i) k and c k 
1 r: our calculation, we select m — 30. 
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FIG. 4: The time evolution of the native 
state and several local minima states. 
The probability of occupation of na- 
tive state conformation (top) increases 
monotonically through a time span of 
10~ 2 — 10 , but local minima conforma- 
tions go through transiently accumulat- 
ing intermediate states. 



Some could be transiently accumu- 
lating, and others either monotopically accumulating 
or monotopically decreasing. Based on the computed 
trajectories of time evolution, we find that the dynamic 
behavior of local minima conformations can be predicted 
from basin analysis of the move-connected energy land- 
scape. We define the size of the basin associated with 
each local minimum state i computationally by artifi- 
cially making every local minimum an absorption state, 
i.e., a sink of infinite depth, such that once reached, 
no molecule can escape. This is achieved by assigning 
Tin = and rn — 1 for each local minimum state i 
|13| . p[(t = oo) therefore reflects the size of the basin 
of the z-th local minimum. We define the accumulation 



r. If g > 1, state i is most 



ratio as g = e _ E . /T/£ £ - Ej /t - 

likely a transient accumulating state, i.e., the other 
conformations in its basin first rapidly flow to state i 
before transiting to conformations outside the basin. If 
g < 1, depending on its initial probability of occupancy 
and the final Boltzmann factor, state i may be cither 
a monotonically decaying or accumulating state. We 
find that among the 493 local minima states for this 
sequence, all except 3 are transiently accumulating, 
indicating they are responsible for forming transient 
state ensemble of various time scale. 



To understand whether the formation of certain native 
contacts facilitate folding, we examine the time evolu- 
tion of each of the 8 native contacts (a-h) in Fig. ^b) 
for the 26 sequences. We found that fast folders have 
larger amount native contact d (R 2 — 0.74 — 0.81 with 
log/c/), and contact c at the transient time of50 — 100 
(Fig. 0}, indicating that these contacts are critical for 
folding by restricting favorably the conforamtional search 
space. The formation of other native contacts seem not 
to be directly related to folding rates. 

To conclude, we studied protein folding dynamics using 
a model based on detailed physical moves and exact solu- 
tion of the master equation. We found that folding rates 
vary enormously for sequences of the same length, energy, 
energy gap, and even of the same ground state confor- 
mation. In contrast to the thermodynamic parameters 
which are weak predictors of folding rates, properties of 
the kinematic landscape defined by the physical moves 
provide excellent correlation with folding rates. With the 
computation of time evolution of individual conformation 
from the first move to half-time of equilibrium, we show 
that many transiently accumulating intermediate states 
can be identified by basin analysis. 
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